Face Recognition using Optimal Representation 

Ensemble 



Hanxi Li^'^, Chunhua Shen^, Yongsheng Gao^'^ 
^NICTA* Queensland Research Laboratory, QLD, Australia 
^Griffith University, QLD, Australia 
^University of Adelaide, SA, Australia 

4.10-2011 



Abstract 

Recently, the face recognizers based on linear representations have been shown 
to deliver state-of-the-art performance. These approaches assume that the faces, 
belonging to one individual, reside in a linear-subspace with a respectively low 
dimensionality. In real-world applications, however, face images usually suffer 
from expressions, disguises and random occlusions. The problematic facial parts 
undermine the validity of the subspace assumption and thus the recognition per- 
formance deteriorates significantly. In this work, we address the problem in a 
learning-inference-mixed fashion. By observing that the linear-subspace assump- 
tion is more reliable on certain face patches rather than on the holistic face, some 
Bayesian Patch Representations (BPRs) are randomly generated and interpreted 
according to the B ayes' theory. We then train an ensemble model over the patch- 
representations by minimizing the empirical risk w.r.t. the "leave-one-out mar- 
gins". The obtained model is termed Optimal Representation Ensemble (ORE), 
since it guarantees the optimality from the perspective of Empirical Risk Mini- 
mization. To handle the unknown patterns in test faces, a robust version of BPR 
is proposed by taking the non-face category into consideration. Equipped with the 
Robust-BPRs, the inference ability of ORE is increased dramatically and several 
record-breaking accuracies (99.9% on Yale-B and 99.5% on AR) and desirable ef- 
ficiencies (below 20 ms per face in Matlab) are achieved. It also overwhelms other 
modular heuristics on the faces with random occlusions, extreme expressions and 
disguises. Furthermore, to accommodate immense BPRs sets, a boosting-like algo- 
rithm is also derived. The boosted model, a.k.a. Boosted-ORE, obtains similar per- 
formance to its prototype. Besides the empirical superiorities, two desirable fea- 
tures of the proposed methods, namely, the training-determined model-selection 
and the data-weight-free boosting procedure, are also theoretically verified. They 
reduce the training complexity immensely, while keeps the generalization capacity 
not changed. 

*NICTA is funded by the Australian Government as represented by the Department of Broadband, Com- 
munications and the Digital Economy and the Australian Research Council through the ICT Center of Ex- 
cellence program. 
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1 Introduction 



Face Recognition is a long-standing problem in computer vision. In the past decade, 
much effort has been devoted to the Linear Representation (LR) based algorithms such 
as Nearest Feature Line (NFL) [1], Nearest Feature Subspace (NFS) [2], Sparse Rep- 
resentation Classification (SRC) [3] and the most recently proposed Linear Regres- 
sion Classification (LRC) [4]. Compared with traditional face recognition approaches, 
higher accuracies have been reported. The underlying assumption for the LR-classifiers 
is that the faces of one individual reside in a low-dimensional linear manifold. This 
assumption, however, is only valid when the cropped faces are considered as rigid 
Lambertian surfaces and without any occlusion [5, 6]. In practice, the linear- subspace 
model is sometimes too rudimentary to handle expressions, disguises and random oc- 
clusions which usually occur in local regions, e.g. expressions influence the mouth and 
eyes more greatly than the nose, scarves typically have the impact on lower-half faces. 
The problematic face parts are not suitable for performing the linear representation and 
thus reduce the recognition accuracy. On the other hand, there should be some face 
parts which are less problematic, i.e. more reliable. But, how can we evaluate the re- 
liability of one face part? Given the reliabilities of all the parts, how do we make the 
final decision? 

Several heuristic methods were introduced to address the problem. In particular, the 
modular approach is used in [3] and [4] for eliminating the adverse impact of continu- 
ous occlusions. Significant improvement in accuracy was observed from the partition- 
and-vote [3] or the partition-and-compete [4] strategy. The drawbacks of these heuris- 
tics are also clear. First, one must roughly know a priori the shape and location of the 
occlusion otherwise the performance will still deteriorate. It is desirable to design more 
flexible "models" to handle occlusions with arbitrary spatial features. Furthermore, the 
existing heuristics discard much useful information, like the representation residuals in 
[3] or the classification results of the unselected blocks in [4]. Higher efficiencies are 
expected when all the information is simultaneously analyzed. Thirdly, there is great 
potential to increase the performance by employing a sophisticated fusion method, 
rather than the primitive rules in [3] and [4]. Finally, most existing methods neglect 
the fact that the LR-method can also be used to distinguish human faces from non-face 
images, or partly-non-face images. By harnessing this power, one could achieve higher 
robustness to occlusions and noises. 

In this paper, we propose a learning-inference-mixed framework to learn and rec- 
ognize faces. The novel framework generate, interpret and aggregate the partial rep- 
resentations more elegantly. First of all, LRs are performed on randomly-generated 
face patches. Secondly, in a novel manner, we interpret every patch representation as a 
probability vector, with each element corresponding to a certain individual. The inter- 
pretation is obtained via applying Bayes theorem on a basic distribution assumption, 
and thus is referred to as Bayesian Patch Representation (BPR). We then learn a linear 
combination of the obtained BPRs to gain much higher classification ability. The com- 
bination coefficients, i.e. the weights associated with different BPRs, are achieved via 
minimizing the exponential loss w.r.t. sample margins [7]. In this way, most given face- 
related patterns are learned via assigning different "importances" to various patches. 
The learned model is termed Optimal Representation Ensemble (ORE) since it guaran- 
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tees the optimality from the perspective of Empirical Risk Minimization. To cope with 
unknown-patterns in test faces, a variation of BPR, namely Robust-BPR, is derived 
by taking account of the Generic-Face-Confidence. The inference power of the ORE 
model is improved dramatically by employing the Robust-BPRs. 

The BPRs are, essentially, instance-based. One can not simply copy the off-the- 
shelf ensemble learning method to combine them. To accommodate the instance-based 
predictors and optimally exploit the given information, we propose the leave-one-out 
margin for replacing the conventional margin concept. The leave-one-out margin also 
makes the ORE-Learning procedure extremely resistant to the overfitting, as we theo- 
retically verified. One therefore can choose the model parameter merely depending on 
the training errors. This merit of ORE-Learning leads to a remarkable drop in the val- 
idation complexity. In addition, to tailor the proposed method to immense BPR sets, 
a boosting-like algorithm is designed to obtain the ORE in an iterative fashion. The 
boosted model, Boosted-ORE, could be learned very efficiently as we prove that the 
training procedure is unrelated to data weights. From a higher point of view, we offer 
an elegant and efficient framework for training a discriminative ensemble of instance- 
based classifiers. 

A few work has used ensemble learning methods for face recognition [8, 9, 10, 
11, 12]. Nonetheless, those methods only combine the model-based, primitive clas- 
sifiers, e.g. Linear Discriminant Analysis (LDA) or Principle Component Analysis 
(PC A), which are sensitive to illumination, easy to over-fit and neglect the non-face 
category. In contrast, our Bayesian-rule-based BPRs overcome all these drawbacks. 
Furthermore, the proposed ensemble methods globally minimize an explicit loss func- 
tion w.r.t. margins. It serves as a more principled way, in comparison to the simple 
voting strategies [8, 9, 12] or the heuristically customized boosting schemes [10, 11]. 

The experiment part justifies the excellence of proposed algorithms over conven- 
tional LR-methods. In particular, ORE achieves some record-breaking accuracies (99.9% 
for Yale-B dataset and 99.5% for AR dataset) on the faces with extreme illumination 
changes, expressions and disguises. Boosted-ORE also shows similar recognition ca- 
pability. Equipped with the GFC, Robust-ORE outperforms other modular heuristics 
under all the circumstances. Moreover, the ORE-model also shows the highest effi- 
ciency (below 20 ms per face with Matlab and one CPU core) among all the compared 
LR-methods. 

The rest of this paper is organized as follows. In Section 2, we briefly introduce 
the family of LR-classifiers and the modular heuristics. BPR and Robust-BPR are pro- 
posed in the following section. The learning algorithm for obtaining ORE is derived in 
Section 4 where we also prove the validity of the training-determined model- selection. 
The derivations of the boosting-like variation, a.k.a. Boosted-ORE, and its desirable 
feature in terms of ultrafast training are given in Section 5. Section 6 introduces the 
learning-inference-mixed strategy of the ORE algorithm. The experiment and results 
are shown in Section 7 while the conclusion and future topics can be found in the final 
section. 
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2 Background 



2.1 The family of LR-classifiers 

For a face recognition problem, one is usually given TV vectorized face images X e 
^DxN belonging to K different individuals, where D is the dimensionality of faces 
and N is the face number. Let us suppose their labels are 1 = {h^hr - - :In}: h ^ 
{1, 2, • • • , K} Vi. When a probe face y G is provided, we need to identify it as one 
individual exists in the training set, i.e. 7y = H{y) G {1,2,..., K}, where H{-) is the 
face recognizer that generates the predicted label 7. Without loss of generality, in this 
paper, we assume all the classes share the same sample number M = N/K. For the kth 
face category, let x-^ G denote the ith face image and Xj^ = [xi, X2, . . . , xm] ^ 
j^DxM indicates the image collection of the kth class^ 

Nearest Neighbor (NN) can be thought of as the most primitive LR-method. It 
uses only one training face, a.k.a. the nearest neighbor, to represent the test face. 
However, without a powerful feature extraction approach, NN usually performs very 
poorly. Therefore, more advanced methods like NFL [1], NFS [2], SRC [3] and LRC 
[4] are proposed. Most of their formulations ([1, 2, 4]) could be unified. For class 
k G {1, 2, • • • , K}, SL typical LR-classifier firstly solve the following problem to get 
the representation coefficients /3l, i.e. 

min ||y-Xft/3,||2 V fc G {1, 2, ■ ■ ■ , i^}, (1) 

where || • \\p stands for the norm and X^ is a subset of X/^, selected under certain 
rules. The above problem, also known as the Least Square Problem, has a closed-form 
solution given by 

m = (XlXfc)-iXly. (2) 
The identity of test face y is then retrieved as 

7y = argmin r/,, (3) 

where Vk is the reconstruction residual associated with class k, i.e. 

ru = \\y-±M2. (4) 

Different rules for selecting X/^ actually specify different members of the LR- 
family. NN merely use one nearest neighbor from X/^ as the representation basis; NFL 
exhaustively searches two faces which form a nearest line to the test face; NFS conduct 
a similar search for the nearest subspace with a specific dimensionality; Finally, at the 
other end of the spectrum, LRC directly employ the whole X/^ to represent y. Note 
that although the solution of problem (1) is closed-form, most LR-method requires a 
brute-force search to obtain Xj^. The only exception exists in LRC where X/^ = X/^, 
thus LRC is much more faster than the other members. 

^For simplicity, we slightly abuse the notation: the symbol of a matrix is also used to represent the set 
comprised of all the columns of this matrix. 
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The SRC algorithm, on the other hand, solves a second-order-cone problem over 
the entire training set X. The optimization problem writes: 



mm ||/3||i s.t. ||y-X/3||2<5. 



(5) 




(6) 



where function Sk{^) sets all the coefficients of /3 to except those corresponding to 
the kth class [3]. The identifying procedure of SRC is the same^ to (3). By treating the 
occlusion as a "noisy" part, Wright et al. [3] also proposed a robust version of SRC, 
which conducts the optimization as follows: 



where I is an identity matrix, u = [/3, e]^ and e is the representation coefficients 
corresponding to the non-face part. SRC is very slow [13] due to the second-order- 
cone programming and the assumption is also doubtful [14]. 

Obviously, all the LR-methods are generative rather than discriminative. Their 
main goal is to best reconstruct test face y, while the subsequent classification proce- 
dure seems a "byproduct". Nonetheless, they still achieved impressive performances 
because the underlying linear- sub space theory [5, 6] keeps approximately valid no mat- 
ter how the illumination changes. Unfortunately, for the face with extreme expressions, 
disguises or random contaminations, the theory doesn't hold anymore and poor recog- 
nition accuracies are usually observed. 

2.2 Two modular heuristics for robust recognition 

To ease the difficulties, some modular methods are proposed. In particular, Wright et 
al. [3] partition the face image into several (usually 4 to 8) blocks and perform the 
robust SRC, as illustrated in (7), on each of them. The final identity of the test face is 
determined via a majority voting over all the bloclcs. We term this algorithm as Block- 
SRC in this paper. The Distance based Evidence Fusion (DEF) et al. [4] modifies the 
LRC via a similar block- wise strategy while the predict label is given by a competition 
procedure. Without loss of generality, their methods can be summarized as: 



where = ^t,2 5 • • • , ^i^kY the collection of all the K residuals for the tth 
block, F( ) refers to the fusion method which counts the votes in [3] and perform 
the competition in [4]. In other words, F(-) reflects how we combine the block- 
representations' outputs. 

The modular approaches did increase the accuracy. Their success implies that the 
linear-subspace assumption is more reliable on certain face parts, rather than the holis- 
tic face. Nonetheless, their drawbacks, as described in the introduction part, are also 
obvious. In the following sections, we build a more elegant framework to generate, 
interpret and aggregate the partial representations. 

^Note that for SRC, = X^. 



min ||tx||i s.t. ||y-[X, I]tx||2 < ^, 

u 



(7) 




(8) 
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3 Bayesian Patch Representation 

3.1 Random face patches 

3.1.1 What are random face patches 

A random face patch is a continuous part of the face image, with an arbitrary shape 
and size. The method based on image patches has illustrated a great success in face 
detection [15]. Differing from the Haar-feature in face detection, linear representations 
are much more sophisticated. There is no need to generate the patches exhaustively. 
In this paper, we only employ 500 small patches randomly distributing over the face 
image. Those patches are already sufficient to sample all the reliable face parts. Dif- 
ferent weights are assigned to these patches to indicate their importances for a specific 
recognition task. We expect that a certain combination of these patches could yield 
similar classification capacity to the direct use of all the reliable regions. 

Figure 1(a) gives us an example of the weighted patches. 500 random face patches 
are generated with different shapes (here only rectangles). The higher its weight is 
assigned, the redder and wider a patch is shown. The weights are obtained by using 
the proposed ORE-Learning algorithm on AR [16] dataset. Note that most patches are 
purely blue which implies their weights are too small to influence the classification. 
We simply ignore those patches in practices. 

3.1.2 Why random face patches 




(a) weighted patches (b) pixel-energy map (c) focused face 



Figure 1: The demonstration of random face patches, (a): 500 random face patches with different weights. 
The weight is represented by the color and width of edges, (b): the corresponding pixel-energy map, the 
energy of one pixel is defined as the average weight of all the overlapping patches, (c): the simulated focusing 
behavior. Only a small part of the face is emphasized (focused) while the others are ignored (blurred). The 
weights are obtained by using the proposed ORE-Learning algorithm on AR [16] dataset. 

Compared with the deterministic blocks, random patches have the following two 
advantages. 

• More flexible. The reliable region of a given face could be in arbitrary shape 
and location. Deterministic blocks are therefore too rudimentary to represent 
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it. The random patch approach, on the other hand, can approximate any shape 
of interested regions. Figure 1(b) illustrates a pixel-energy map corresponding 
to the patches shown in Figure 1(a). A pixel's energy is the mean weights of 
all the patches covering this pixel. A irregularly- shaped but reasonable region, 
which includes two eyes and certain parts of the forehead, is emphasized. From a 
bionic perspective, it is promising to aggregate the random patches for simulating 
the focusing behavior of human beings. Figure 1(c) illustrates the simulated 
focusing behavior: the face is blurred according to the pixel-weights, only the 
focused facial part, a.k.a. the emphasized region, remains clear. 

• More efficient. According to Figure 1, only a limited number (always in the 
order of 10^ in this work) of patches are taken into consideration. As we empiri- 
cally proved in the experiment, the complexity of performing the LR method on 
several small patches are usually lower than that for few large blocks. 

3.2 Bayesian Patch Representation 

Given that the linear-subspace assumption is more reliable on certain face patches, it 
is intuitive to perform the LR-method for each patch. In principle, we could employ 
either member of the LR-family to perform the linear representation on the patches. 
According to the theoretical analysis [5, 6], however, it seems no need to specifically 
select a certain subset from X/^ . We thus employ the whole X/^ to form the represen- 
tation basis, just like what [4, 5] did. In particular, for class k and patch t, we denote 
the patch set as X^ = [x^,X2, • • • ,x^] G R^^^, with each column obtained via 
vectorizing a image patch. The representation coefficients (3^ for the kih class and 
t\h patch is then given by 

= {X*"'X* )-iX*^y. (9) 

Then the residual rt^k can be obtained as 

ra = ||yt-X*/3*,||2, (10) 

where yt is the cropped test image according to the patch location. In this paper, all 
the patches are normalized so that their ^2 norms are equal to 1. As a result, rt^k ^ 
[0,1], Vt,A:. 

Ordinary LR-methods, including the robust variations, only focus on the smallest 
residual or the corresponding class label. This strategy will lose much useful informa- 
tion. Differing from the conventional manner, we interpret every patch representation 
as a probability vector bt. The k\h element of bt, namely h^k,^^ the probability that 
current test patch yt belongs to individual k i.e. 

bt,k = Phy = k\yt). (11) 

We obtain the above posteriors by applying the Bayesian theorem. First of all, 
it is common that all the classes share the same prior probability, i.e. P {jy = k) = 
l/K, y k. The linear-subspace assumption states that, if one test face belongs to 
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class k, the test patch yt should distribute around the linear- subspace spanned by X^. 
The probability of a remote yt is smaller than the one close to the subspace. In this 
sense, when the category is known, we can assume the random variable yt belongs to 
a distribution with the probability density function 

P{yt\ly = k)=C-exp{-ryS), (12) 

where 5 is a. assumed variance and the C is the normalization factor. This distribution, 
in essence, is a singular normal distribution as its covariance matrix is singular. Fig- 
ure 2 depicts the tailored distribution in a 2-D space for the linear- subspace assumption. 




Figure 2: The demonstration of the singular normal distribution tailored to the linear-subspace assumption. 
The black line indicates the linear-subspace k while different colors represent different probabilities. The 
surface of the probability density function is also shown above. Note that here D = 2 thus the subspace can 
have the dimensionality at most 1, or in other words, a line. 

According to B ayes' rule, the posterior probability is then derived as 
^ P(yt\ly = k)-Pijy = k) 

Ef=lPiyt\ly=j)-P{ly=j) 

Ef=iC/K-eM-rlj/S) ^^^^ 
J2f=i(^M-rlj/S) 

As an example, Figure 3 shows the distribution of the posterior bt,i when there are only 
2 orthogonal linear-subspaces (1 and 2) and dimensionality D = 2 

We finally aggregate all the posteriors into a vector ht = [6t,i, 64,2, • ■ ■ , bt,KY • 
The elegant interpretation ht, termed Bayesian Patch Representation (BPR), keeps 
most information related to the representation and thus could lead to a more accu- 
rate recognition result. In practice, it makes little sense to impose a constant 5 for all 
the patches and faces. We thus use normalized one, 

5t = 0.1 • min(r,%), (14) 

k 
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^2 

Figure 3: The demonstration of the posterior distribution bt,i = P (7y = 1 | yt). The dimensionality of 
the original space is 2. In this example, two linear- subspace, i.e. the two black lines, are orthogonal to each 
other. 

for the tth patch. 

4 Combine the Bayesian Patch Representations 

4.1 Learn a BPR ensemble via Empirical Risk Minimization 

Besides the interpretation, the aggregation method is also vital for the final classifica- 
tion. The existing fusion rules, as shown in (8), are rudimentary and non-parameterized 
thus hard to optimize. In the machine learning community, classifier-ensembles learned 
via an Empirical Risk Minimization process are considered to be more powerful than 
the simple methods [17, 18]. 

As a consequence, we linearly combine the BPRs to generate a predicting vector 

C(y)=Ki(y), C2(y),. • • ,^Kiy)V& m^, i.e. 

T 

^(y)=E"*b*(y) = B(y)a, (15) 

t=l 

with (y ) indicating the confidence that y belongs to the kth class, and a = [ai , 0^2 , . . . , Q^t]^ ^ 
0. The identity of test face y is then given by 

7y = argmin ^/e(y) (16) 

/cG{l,--- ,K} 

This kind of linear model dominates the supervised learning literature as it is flex- 
ible and feasible to learn. The parameter vector a is optimized via minimizing the 
following Empirical Risk 

N 

ER = ^Loss(zi) + A • Reg(a), (17) 
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where Loss(-) is a certain loss function, Reg(-) is the regularization term and A is the 
trade-off parameter. The margin zi = Z{li, ^(^z)) reflects the confidence that ^ select 
the correct label for x^. Specifically, for binary classifications, 

=6,(xO-e/'(xi), I T^k. (18) 

For multiple-class problems, however, there is no perfect formulation of Zi. We then 
intuitively define the Zi as 

1 ^ 

i.e. the mean of all the "bi-class margins". Recall that Ylf=i ^tji'^i) = 1» we then 
arrive at a simpler definition of Zi, 



E 

t=l 

By absorbing the constant K/{K — 1) into each at, we have 

T 



j^atLu^i)-^,]. (21) 



The term bt^i.{'Ki) — 1/K can be though of as the confidence gap between using the 
tth BPR and using a random guess. The larger the gap, the more powerful this BPR 
is. Consequently, Zi is the weighted sum of all the gaps, which measures the predicting 
capability of ^(x^). 

The selection for the loss function and the regularization function has been ex- 
tensively studied in the machine learning literature [19]. Among all the convex loss 
formulations, we choose the exponential loss Loss(2;4) = exp(— 2;^), motivated by its 
success in combining weak classifiers[18, 20]. The £1 norm is adopted as our regular- 
ization method since it encourages the sparsity of a, which is desirable when we want 
an efficient ensemble. Finally, the optimization problem in this paper is given by: 




(22) 

s.t. a ^ 0, ||a||i < A 

Note that for easing the optimization, we convert the regularization term to a constraint. 
With an appropriate A, this conversion won't change the optimization result [21]. The 
optimization problem is convex and can be solved by using one of the off-the-shelf 
optimization tools such as Mosek [22] or CVX [23]. The learned model is termed Op- 
timal Representation Ensemble (ORE) as it guarantees the global optimality of ol from 
the perspective of Empirical Risk Minimization. The learning algorithm for achieving 
the ORE is referred to as ORE-Learning. 
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4.2 Leave-one-out margin 

It would be simple to calculate the margin Zi if the BPR were model-based, i.e. bt(-) 
was a set of explicit functions. In fact, that is the situation for most ensemble learning 
approaches. Unfortunately, that is not the case in this paper where bt(-) is actually 
instance-based. 

For a BPR, we always need a gallery, a.k.a. the representation basis, to calculate 
bt( ). And ideally, the gallery should be the same for both training and test, otherwise 
the learned model is only optimal for the training gallery. Nonetheless, we can not 
directly use the training set, which is the test gallery, as the training gallery. Any 
training sample will be perfectly represented by the whole training set because 
itself is in the basis. Consequently, all BPRs will generate identical outputs and the 
learned weights at, Vt will also be the same. To further divide the training set into one 
basis and one validation set, of course, is a feasible solution. However, it will reduce 
the classification power of ORE as the larger basis usually implies higher accuracies. 

To get around this problem, we employ a leave-one-out strategy to utilize as many 
training instances as possible for representations. For every training sample x^, its 
gallery is given by 

xf = X\xi, 

i.e. the complement of x^ w.r.t. the universe X. The leave-x^-out BPRs, referred to as 

b^' (x^), Vt, are yielded based on the gallery xp. The leave-one-out margin Zi is then 
calculated as 

^, = ^a,fe(x,)-iy (23) 

t=l ^ ^ 

In this way, the size of the training gallery is always — 1, we can approximately 
consider the learned a* as optimal for the test gallery X with the size of N. 
After a* is obtained, we also calculate the leave-one-out predicting vector as 

r?(xO = B-?(xi)a*, (24) 

where B^? (x^) is the collection of the leave-one-out BPRs. The training error of the 
ORE-Learning is given by 

1 (J 

etrn = ^largmax^^^ (x^) 7^ kj, (25) 

i=i ^ 

where |-] denote the boolean operator. This training error, as illustrated below, plays a 
crucial role in the model-selection procedure of ORE-Learning. 

4.3 Training-determined model-selection 

Another issue arising here is how to select a proper parameter A for the ORE-Learning. 
Usually, a validation method such as the n-fold cross-validation is performed to select 
the optimal parameter among candidates. The validation method, however, is expensive 
in terms of computation, because one needs to repeat the extra "subset training" for n 
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times and usually n > 5. From the instance-based perspective, a cross-validation is 
also unacceptable. In every "fold" of a n-fold cross-validation, we only use a part of 
training samples as the gallery. The setting contradicts the principle that one needs to 
keep the representation basis similar over all the stages. 

Fortunately, the leave-one-out margin provides the ORE-Learning an advantage: 
The training error of the ORE-Learning serves as a good estimate to its leave-one- 
out error. We can directly use the training error to select the model-parameter A. To 
understand this, let's firstly recall the definition of the leave-one-out error. 

Definition 4.1. (Leave-one-out error [1]) Suppose that denotes a training set 
space comprised of the training sets with N samples {xi,X2, • • • ,Xiv}. Given an 
algorithm A : U^=i ~^ where T is the functional space of classifiers. The 
leave-one-out error is defined by 



where F^c = -4(A'^\x^), i.e. the classifier learned using A based on the set A'^\x^. 

The leave-one-out error is known as an unbiased estimate for the generalization 
error [7, 24]. Our target in this section is to build the connection between eioo and etm 
for ORE-Learning. Suppose that all the training faces are non-disguised, which is the 
common situation, then let us make the following basic assumption. 

Assumption: One patch-location t on the human face could be affected by Qt dif- 
ferent expressions. Every expression leads to a distinct and convex Lambertian surface. 

According the theory in [5] and [6], the different appearances of one patch surface, 
caused by illumination changes, span a linear- subspace with a small dimensionality ^. 
Given that M training patches from the patch-location t is collected in X^, its arbitrary 
subset Xp contains P (P < ^ <C M) samples. With the assumption, we can verify 
the following lemma. 

Lemma 4.1. (The stability of BPRs ) If the training subset X^ contains at least + 
P) i.i.d. patch samples, set X^ and set X^\Xp share the same representation basis. 

Proof: Let us denote the linear-subspace formed by X^ as U^. refers to its subset 
spanned by the patches associated with surface q. We know that 

Qt 

Rank(ZY*) < ^Rank(ZY^) = ^Qt- 



When Xp is moved out, the new linear-subspace spanned by X^\Xp is denoted by 
U^. According to the given condition, there are still ^Qt i.i.d. patches remaining. Then 
with an overwhelming probability. 




(26) 



Rank(ZY*) = ^Qt = Rank(ZY^). 
Considering that CU^, we then arrive at 



The space remains the same, so does its basis. 



□ 
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In contrast, other classifiers, such as decision trees or Hnear-LDA-classifiers, don't 
have this desirable stability. They always depend on the exact data, rather than the 
extracted space-basis. 

Note that $ is usually very small [5,6]. The value of Qt is determined by the types 
of expressions that can affect patch t. It is also very limited if we only consider the 
common ones. That is to say, with a reasonable number of training samples, the BPRs 
is stable w.r.t. the data fluctuation. Specifically, when is left out (P = 1), all the 
BPRs' values on samples {kj g X | j 7^ i} won't change, i.e. 

hf{^j)=hf-{^j), ^i^j, i,jG{l,2,--. ,7V}, (27) 

where x^^ stands for the complement of set {x^, Xj}. From the perspective of ensem- 
ble learning, the original ORE-Learning problem A{X^) and the leave-x^-out problem 
A{?^^\^i) share the same "basic hypotheses" 

bt(x), VtG{l,2,... ,T}, 

and constraints 

q: ^ & ||q:||i < A. 

The only difference is that the former problem involves one more training sample, x^ . 
We know that usually N ^ 1, thus one can approximately consider their solutions are 
the same, i.e. 

<c =«*, Vi, (28) 

where q:*c is the optimal solution for problem ^(A'^\x^). Finally, we arrive at the 
following theorem 

Theorem 4.1. With Equation (28) holding, the training error of the ORE-Learning 
exactly equals to its leave-one-out error 

Proof: In the context of ORE, all types of errors are determined by the predicting 
vectors ^(x^), Vi. For the leave-one-out error, we know that 

|^^^(x,)=B^?(xO<c Vi, (29) 
where B^? (x^) is defined in (24). Recall that 

C*™(xO = r"(xi) = B-?(xi)a*, Vi. 
If Equation (28) is valid, then obviously, ^*''"(xi) = ^'°°(xi). Finally, we have 

1 ^ 

etrn = T7 ^1 largmax " (x^ ) ^ kj 
i=i ^ 

1 ^ , (30) 

<=1 

~ ^loo- 

□ 
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In practice, Lemma 4.1 and Equation (28) could be only considered as approxi- 
mately true. However, we still can treat the training error as a good estimate to the 
leave-one-out error. Recall that n-fold cross-validation is an approximation to the 
leave-one-out validation. Thus the cross-validation error, a commonly used criterion 
for model-selection, is also a estimate to the leave-one-out error. We then can directly 
employ the training error of ORE to choose the model-parameter, without an extra 
validation procedure. The fast model-selection, termed "training-determined model- 
selection" is justified empirically in the experiment. We tune the A for both ORE and 
Boosted-ORE, which is introduced below. No significant overfitting is observed. 

Corollary 4.1. One can directly set the parameter A to a very small value, e.g. A = le- 
5, to achieve the ORE-model with a good generalization capability. 

Proof: Because the training error of a ORE-Learning directly reflects its generalization 
capabihty, the ORE-Learning is very resistant to overfittings. Considering that the main 
reason for imposing the regularization is to curb overfittings, one can totally discard 
the regularization term in optimization problem (22). However, to prevent the problem 
from being ill-posed, we still need a constraint for ex. Thus one A with a small value, 
which implies trivial regularizing effect, is appropriate. □ 

The above corollary is also verified in the experimental part. Admittedly, without 
an effective ii regularization, one can not expect the obtained ORE-model is sparse 
and efficient. Consequently, we still conduct the training-determined model-selection 
to strike the balance between accuracy and efficiency. 

5 ORE-Boosting for immense BPR sets 

In principle, the convex optimization for the ORE-Learning could be solved perfectly. 
Nonetheless, sometimes the patch number T is enormous or even nearly infinite. In 
those scenarios, to solve problem (22) via normal convex solvers is impossible. Recall 
that boosting-like algorithms can exploit the infinite functional space effectively [17, 
18]. We therefore can solve the immense problem in a boosting fashion, i.e. the BPRs 
are added into the ORE-model one by one, based upon certain criteria. 

5.1 Solve the immense optimization problem via the column-generation 

The conventional boosting algorithms [17,18] conduct the optimization in a coordinate- 
descend manner. However, it is slow and can not guarantee the global-optimality at 
every step. Recently, several boosting algorithms based on the column-generation 
[25, 20, 26] were proposed and showed higher training efficiencies. We thus follow 
their principle to solve our problem. 

To achieve the boosting-style ORE-Learning, the dual problem of (22) need to be 
derived firstly. 
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Theorem 5.1. The Lagrange dual problem of (22) writes 

N 

A 

3.t. ^Ui[bt,u{^i)--] <r, Vt, 



1 ^ 

min r-\- (ui log -u^ - -u^) 



u ^ 0. 



(31) 



Proof: Firstly, let us rewrite the primal problem (22) as 

AT 

min }^ ex.p{cpi) 



^ ( 1 

t=\ ^ 
« ^ 0, ||q:||i < A. 



Vi, 



(32) 



After assigning the Lagrange multipliers [21] u G M^, q G and r G M associated 
with above constraints, we get the Lagrangian 

(33) 



L(a, V?, u, q, r) = J] exp((^i) - J] ^» (^'^'^ + 
-q^a + r(l^a- A). 



where 6>t,i = ht^u (^0 ~ 1/^? ^ ^^^^ Q ^ 0- The Lagrange dual function is defined 
as the "infimum" of the Lagrangian, i.e. 



inf L =inf y^exp((^^) - Uiifi 



rX 



must be 



^ the conjugate of exp ((pi) 



(34) 



2_^sup(iAi(/?^ - exp((/?i)) -rA 

AT 

^(^/^ log'?/^ - Ui) - rX 



where 6i = [6>i,^, 6^2,*, • • • , ^T,i]- After eliminating q we get the first t constraints in 
the dual problem. The conjugate of function exp{(pi) requires that u :^ 0, a.k.a. the 
second constraint of (31). The dual problem is to maximize the above Lagrangian. 
After simple algebraic manipulations, (31) is obtained. □ 
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In Theorem 5.1, u = [ixi, iX2, • ' ' , ^a^] is usually viewed as the weighted data distribu- 
tion. Considering that BPR is instanced-based and thus depends on u, we then use 
to represent the tth BPR under the data distribution u. 

With the column-generation scheme employed in [25, 20, 26] and Theorem 5.1, 
we design a boosting-style ORE-Learning algorithm. The algorithm, termed ORE- 
Boosting, is summarized in Algorithm 1. 

5.2 Ultrafast — the data- weight-free training 

For the conventional basic hypotheses used in boosting, such as decision trees, deci- 
sion stumps and the linear-LDA-classifiers, one needs to re-train them after the training 
samples' weights u are updated. Usually, the re-training procedure dominates the com- 
putational complexity [25, 20]. 

Apparently, we need to follow this computationally expensive scheme since BPRs 
are totally data-dependent. It is easy to see the computation complexity of each BPR is 

CL = 0{M^)^0{M'^d), (36) 

then the complexity of the training procedure is given by 

Ctrain = T • S • Cl = 0{TSM^) + 0{TSM^d), (37) 

The whole training procedure could be very slow when T and S are both large. 

However, we argue that: the ORE-Boosting can be performed much faster. To 
explain this, let us firstly rewrite the constraint u :^ in (31) as u ^ 0. This change 
won't influence the interior-point-based optimization method [21]. Then we can prove 
the following theorem. 

Theorem 5.2. Given that u ^ 0, The BPRs are independent of the weight vector u. In 
other words, for ORE-Boosting, all the BPRs need to be trained only once. 

Proof: let U/e G R^^^ be the diagonal matrix such that U/e(i,i) = u/e(i), i = 
1, 2, . . . , M, where Uj^ is the weight vector for the training face images from the kth 
class. By taking account of the data weight, the representation coefficients associated 
with patch t are given by. 

^* ^ = argmin ||y, - X*^Ufc/3||2, (38) 

which has a closed-form solution that writes 

= (U,xf X^Ufc)-iU,xf yt (39) 

and we know that 

u ^ =^ Uk>0 =^ U^^ exists. (40) 
Thus (39) can be further rewritten into 

^l, = U^i(xf X*)" U,-iUfeX*^yt 

= U-(xfx*)-xfy, (41) 
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where /S^ is the solution to the unweighted BPR. We now can obtain the reconstruc- 
tion residual as 

f* =||yt-X*Ufe^;,||2 ^^2) 

This result, without loss of generality, is valid for all the classes and patches. Consid- 
ering that BPRs are determined by the associated residuals, we arrive at 

= Kk- (43) 

That is to say, training data weights do not have any impact on the BPRs. 

Actually, a more intuitive understanding of the above analysis is in (38): if we treat 
Uk/3 as the variable of interest, we solve exactly the same problem as the standard 
least squares fitting problem. □ 

According to the theorem, one needs to calculate the BPRs only once. In practice, 
the following calculations are conducted for all the BPRs and training samples. 

c^ = 6,,z,(x,)-l Vt,i. (44) 

Note that k is the ground-truth category of x^. T oracle vectors Ct = [c^ , c^, • • • , c^]^ , Vt 
are stored beforehand. When we performing the ORE-Boosting, the optimization task 
in (35) is reduced to 

t* = argmax (cju) ; (45) 

With the oracle vectors Ct , Vt, the training cost is reduced by S times to 

Ctrain = T • Cl = 0{TM^) + 0{TM^d). (46) 

Usually, 5* is of order 10^, so the above strategy can gain a speedup of a few hundred 
times (see Section 7.6.2). This desirable property makes the proposed ORE-Boosting 
very compelling in terms of computation efficiency. 



6 Face Recognition Using ORE — a Sophisticated Mix- 
ture of Inference and Learning 

As we discussed above, the LR-based algorithms are generative rather than discrim- 
inative. Their main goal is to reconstruct the test face using training faces. From 
another point of view, every LR-based algorithms is a pure inference procedure using 
the generative model associated with a specific linear-subspace-assumption. There is 
no learning process performed because the generative model is predetermined by the 
theoretical analysis [5, 6]. However, the theoretical proof is only valid under certain 
ideal conditions and the linear-subspace-assumption itself is an approximation to the 
derived illumination cone [5]. When this approximated model is applied with "imper- 
fect" gallery faces, accuracy reductions always occur. 
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The "imperfect" training faces, from another perspective, usually imply more in- 
formation or patterns involved. By effectively learning the meaningful ones, e.g. ex- 
pressions and disguises, we can enhance the prior-knowledge-determined model and 
achieve higher performance. Of course, not all the patterns can be included in the 
training set. When novel patterns arise during the test, one can only reduce their in- 
fluence via a inferring process. In this sense, we argue that an ideal face-recognizer 
should contain two functional parts: 

1 . A learner, which can extract the existing patterns from the training set. 

2. An inference approach, which can recognize the known patterns while discard 
the foreign ones in the test face. 

The learning algorithms for ORE models has been proposed in Section 4 and 5. 
Now we design the ORE-tailored inference approach. 

6.1 Robust-BPR - BPR with a Generic-Face-Confidence 

The BPR is informative enough to describe a patch-based LR and one can learn cer- 
tain face patterns within the ensemble learning framework. However, in the test phase, 
some unknown patterns, which usually present as non-face patches, might occur. Most 
LR-methods, including the standard BPR, only pay attention to distinguish the face 
between different individuals thus can hardly handle this kind of patterns. On the other 
hand, several evidences [27] suggests that generic faces, including all the categories, 
also form a linear- sub space. The linear- subspace is sufficiently compact comparing 
with the general image space. Furthermore, some visual tracking algorithms have al- 
ready employed LR-approaches (SRC or its variations) to distinguish the foreground 
from the background [28, 29]. 

Inspired by the successful implementations, we propose to employ the linear rep- 
resentation for distinguishing face patches from face-unrelated or partly-face patches. 
Specifically, a badly-contaminated face patch is supposed to be distant from the linear 
subspace spanned by the training patches in the same position. In this manner, one can 
measure the degree of contamination for each test patch. 

Figure 4 illustrates the assumption about the linear-subspace of generic-faces. Note 
that the faces are merely for demonstration, in this paper, we actually focus on the face 
patches. According to this assumption, one test patch will be considered as a face part 
only when it is close enough to the corresponding "generic-face-patch" subspace. 

Now we formalize this idea in the Bayesian framework. Given that all the training 
face patches X* = [X^, X2, • • • , X^] G M^^^ are clean and forming the representa- 
tion basis, for a test patch y^, the reconstruction residual is given by: 

fl = \\Yt - X*(X*^X*)-iX*^yt||2. (47) 

Let us use the notation = 1 to indicate that yt is a face patch while Ut = ^ indi- 
cates the opposite. After taking the non-face category into consideration, the original 
posterior in (11) is equivalent to P (7y = A: | i^t = 1, yt)- The new target posterior 
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Figure 4: The demonstration of the generic-face subspace in the original 3-D feature space. Faces from 
all the categories (K = 2 here) form a 2-D linear-subspace, i.e. sl plane shown in light blue. Two linear- 
subspaces, i.e. the lines shown in blue and red respectively, correspond to two different subjects. In this work, 
however, we are only interested in the face patches and consequently the "generic-face-patch" subspace are 
considered instead. 



becomes 



it,k = P {jy = k,ut = 1 \ yt) 

= P{-fy = k I ut = l,yt) • PK = 1 I yt) (48) 
= bt,k-P{ut = l\ yt). 



Following the principle of linear-subspace, we can assume that 

P{yt\ut = 0) = Co 

P{yt\ut = l)=Ci-eM-rt/~^). 



(49) 



where Ci , Cq is the normalization constant. The subspace for the non-face category is 
the universe space R^, which leads to the uniform distribution P {yt \ Ut = 0) = Cq. 
Recall that all the patches are normalized, thus the domain of yt is bounded. One can 
calculate both Ci and Cq with a specific S. For simplicity, let us define 

Ci 'P(ut = 1) Ci 
because without any specific prior we usually consider P {ut = ^) = P {ut = I). We 
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then arrive at the new posterior, which is given by 

bt,k = h,k •^'(■"f = 1 1 yt) 

^ bt,k-P{yt \ ut = l)-P{ut = l) 

j:je{o,i}Piyt\ut=j)-P{ut=j) (51) 
^ bt,k 
~ l + C'exp(f2/(5)" 

In practice, we replace the original bt,k with its upper bound 

i • exp(-f2/^) . bt,k (52) 

Note that the constant C won't influence the final classification result as all the BPRs 
are linear combined. As a result, we can discard the term 1 /C and avoid the complex 
integral operation for calculating it. 

We call the term exp(— f| / S) the Generic-Face-Confidence (GFC) as it peaks when 
the patch is perfectly represented by generic face patches. With this confidence, we 
can easily estimate how an image patch is face related, or in other words, how is it 
contaminated by occlusions or noises. The BPR equipped with a GFC is less sensitive 
to occlusions and noises, so we refer bt,/c = [^t,i, ^t,2, • • • , ^t,^] as the Robust-BPR. 
The variance 5 is usually data-dependent, we set 

^ = 0.05. (^E^^) ' 

for all the faces. 



6.2 The GFC-equipped inference approach 

With the unknown patterns, the learned patch- weights a^, \/t could not guarantee their 
optimality anymore. An highly- weighted patch-location could be corrupted badly on 
the test image. Consequently, it should merely play a trivial role in the test phase. In 
other words, the importances of all the patches should be reevaluated. We then employ 
the proposed GFC to amend the importances for each patch. When the test face is 
possibly contaminated, we aggregate the Robust-BPRs instead of the original BPRs. 
In addition, the learned at , Vt are not as reliable as before thus we replace the original 
at with its "faded" version, i.e. 

at = al [0,1], Vt, (54) 

where q is the "fading coefficient". The smaller the q is, the less we take account of the 
learned weights. Now we arrive at the new aggregation, which writes: 

T' T' 

^ = J2 ^tht l = Yl ^?GFC,b, (55) 
t=i t=i 
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where T' is the number of selected patches via the previous ORE-Learning and usually 
<C T as we impose a ^l regularization on the loss function. Figure 5 gives us a ex- 
pHcit illustration of mechanism of the patch- weight amending procedure. In the upper 
row, 31 patches are selected by using ORE-Learning. Their weights are also shown as 
stems in the left chart. When a test face is badly contaminated by noisy occlusions, as 
shown in the bottom row, those weights are not reliable anymore. After modified by 
the proposed methods, all the large weights are assigned to the clean locations. Con- 
sequently, the following classification can hardly influenced by the occlusions. From a 
bionic angle, the weight-amendment is analogue to a focus-changing procedure, as the 
previously emphasized parts look ''unfamiliar'' and not reliable anymore. 



0.25 




Figure 5: The demonstration for the patch- weight amending method. Upper row: the selected patches by 
ORE-Learning. Their weights are shown as stems in the left chart. Bottom row: one test face contaminated 
by three noisy blocks. The patches' weights are modified by using GFC. We can observe that the ORE model 
with new weights pay more attention to the clean patches. In other words, the "attention" changes because 
some pre-trusted parts are not reliable anymore. 

The patch-weight amendment and the subsequent aggregation process compose 
the inference part of the ORE algorithm. To distinguish the inference-facilitated ORE- 
model from the original ones, we refer to it as Robust-ORE. Compared with the anti- 
noise method proposed in [3] (see optimization problem (7)), our Robust-ORE does 
not impose a sparse assumption on the corrupted part thus we can handle much larger 
occlusions. Furthermore, our method is much faster than the robust SRC while main- 
tains its high robustness, as shown in the experiment. Most recently, Zhou et al. [30] 
proposed a advanced version of (7) via imposing a spatially-continuous prior to the 
error vector e. The algorithm, admittedly, performed very well, especially on the face 
with single occlusion. However, we argue that the performance gain is due to the extra 
spatial prior knowledge. In this paper, none of the spatial relation is considered. 
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6.3 The learning-inference-mixed strategy 



^ Linear representation 
Other meanings 

Gallery 



Sub 



Patch-j 



, Learning 

■V Leave-one-out 
ORE-Learning 



--0 




inference 



! 



f b" b.TGFC?! ^.b" ~b2~2 "gFC 
^ 1- 

Robust-BPRj Robust-BPR2 



y_ . . . V _ _ . 

GFCa1ib., + GFGa^b.,. 
GFCa'.b,,2 + GFC2a^2b2,2 



Figure 6: The demonstration of ORE-Learning and the inference procedure. The extremely- simplified prob- 
lem only contains two subjects and two patch candidates. All the green items are related to Sub-1 while 
the red ones are related to Sub-2. The solid arrows indicate linear representation approaches, with different 
colors standing for different representation basis. The black solid arrows represent the representations based 
on the all the patches from a certain position while the green and red ones stand for those corresponding to 
Sub-1 and Sub-2 respectively. 



Figure 6 summarizes the ORE algorithm with a simplified setting where only two 
subjects (Sub-1 and Sub-2) and two patches (patch- 1 on the right forehead and patch-2 
on the middle face) are involved. 

From the flow chart, we can see that the ORE algorithm is, in essence, a sophis- 
ticated mixture of inference and learning. First of all, the patches are cropped and 
collected according to their locations and identities (different columns in one collec- 
tion). Secondly, the leave-one-out margins are generated based on the leave-one-out 
BPRs. Then the existing face patterns are learned via the ORE-Learning or ORE- 
Boosting procedure. The learned results, ai and 0^2, indicate the importances of the 
two patches. When a probe image is given, one perform 3 different linear representa- 
tions for each test patch. The LRs with the patches from Sub-1 and Sub-2 generate the 
BPRs bt^i and bt^2 (t ^ {1, 2}) respectively. In addition, we also use all the patches 
from one location to represent the corresponding test patch. In this way, the Generic- 
Face-Confidence (GFCt, VQ is calculated for each location. When calculating the 
ORE output i G {1, 2}, we multiply the term a^bf^k with the corresponding GFC^. 
In this sense, one reduces the influence of unknown patterns (like the sunglasses in the 
example) arise in the test image. This is, typically, an inference manner based on the 
learned information (ai and 0^2) and the prior assumption (the linear- subspaces corre- 
sponding to different individuals and the generic face patches). Finally, the identity 7 
is obtained via a simple comparison operation. 

The excellence of this learning-inference-mixed strategy is demonstrated in the fol- 
lowing experiment part. 
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7 Experiments 



7.1 Experiment setting 

We design a series of experiments for evaluating different aspects of the proposed al- 
gorithm on two well-known datasets, a.k.a. Yale-B [31] and AR [16]. We compare 
the recognition rates, between the ORE algorithm and other LR-based state-of-the- 
arts methods, i.e. Nearest Feature Line (NFL) [1], Sparse Representation Classification 
(SRC) [3], Linear Regression Classification (LRC) [4] and the two modular heuris- 
tics: DBF and Block-SRC. As a benchmark, the Nearest Neighbor (NN) algorithm is 
also performed. For the conventional LR-based methods, random projection (Random- 
faces) [3], PCA (Eigenfaces) [32] and LDA (Fisherfaces) [33] are used to reduce the 
dimensionality to 25, 50, 100, 200, 400. Note that the dimensionality of Fisherfaces 
are constrained by the number of classes. 

The ORE algorithm is performed using the patches each comprised of 225 pixels. 
The widths of those patches are randomly selected from the set {5, 9, 15, 25, 45} and 
consequently we generate the patches with 5 different shapes. Random projections are 
employed to further reduce the dimensionality to 25, 50 and 100. We treat the ORE's 
results with original patches (225-D) as its 200-D performance. The inverse value of 
the trade-off parameter, i.e. is selected from candidates {10, 20, 30, 40, 50, 60, 70, 80, 90, 100}, 
via the training-determined model-selection procedure. The variance 6 and 6 are set 
according to (14) and (53) respectively. We let g = 0.2 for Robust-ORE. As to ORE- 
Boosting, we set the convergence precision e = le-5 and the maximum iteration num- 
ber S = 100. 

When carrying out the modular methods, we partition all the faces into 8 (4 x 2) 
blocks and downsample each block to smaller ones in the size of 12 x 9, as recom- 
mended by the authors [4] . For a fair comparison, we also reduce the dimensionality of 
the face patches to 100 using random mapping when performing the ORE algorithm. 

We conduct the test in different experimental settings to verify the recognition ca- 
pacities of our method both in terms of inference and learning. With each experimental 
setting, the test is repeated 5 times and we report the average results and the corre- 
sponding standard deviations. Every training and test sample, e.g. faces, patches and 
blocks, are normalized so that 

||x,||2 = ||y||2 = l, Vi. 

All the algorithms are conducted in Matlab-R2009a, on the PC with a 2.6GHz quad- 
core CPU and 8GB RAM. When testing the running speed, we only enable one CPU- 
core. All the optimization, including the ones for ORE-Leaming, ORE-Boosting and 
SRC, are performed by using Mosek [22]. 

7.2 Face recognition with illumination changes 

Yale-B contains 2, 414 well-aligned face images, belonging to 38 individuals, captured 
under various lighting conditions , as illustrated in Figure 7. For each subject, we 
randomly choose 30 images to compose the training set and other 30 images for testing. 
The Fisherfaces are only generated with dimensionality 25 as LDA requiring that the 
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reduced dimensionality is smaller than the class number. When performing LRC and 
ORE with 25-D data, we only randomly chose 20 training faces since the least-square- 
based approaches need an over-determined linear system. For this dataset, we employ 
500 random patches as the task is relatively easy. 




Figure 7: The demonstration of Yale-B dataset with extreme illumination conditions. 



The experiment results are reported in Table 1. As can be seen, the ORE-based 
algorithms consistently outperform all the competitors. Moreover, all the proposed 
methods achieve the accuracy of 99.9% on 200-D (225-D in fact) feature space. To 
our knowledge, this is the highest recognition rate ever reported for Yale-B under sim- 
ilar circumstances. Given 1, 140 faces are involved as test samples and the recognition 
rate 99.9%, only 1 faces are incorrectly classified in average. In particular, Robust- 
ORE, i.e. ORE equipped with Robust-BPRs, shows the highest recognition ability. Its 
recognition rates are always above 99.8% when d > 50. The boosting-like variation of 
the ORE algorithm performs similarly to its prototype and also superior to the perfor- 
mances of other compared methods. 

Figure 8 shows the boosting procedure, i.e. the training and test error curves, for 
the ORE-Boosting algorithm with 100-D features. We observe fast decreases for both 
curves. That justifies the efficacy of the proposed boosting approach. Furthermore, no 
overfitting is illustrated even though the optimal model parameter A is selected accord- 
ing to the training errors. It empirically supports our theoretical analysis in Section 4.3. 

ORE-Boosting on Yale-B (100-D) 













Test error 
- - - Training error - 
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Figure 8: Demonstration of the boosting procedure of ORE-Boosting with 100-D features on Yale-B. 

24 



On Yale-B, both ORE-Learning and its boosting-like cousin only select a very 
limited part (usually around 5%) of all the candidate patches, thanks to the ^i-norm 
regularization. To illustrate this, Figure 9 shows all the candidates (Figure 9(a)), the 
selected patches by ORE-Learning (Figure 9(b)), and those selected by ORE-Boosting 
(Figure 9(c)). We can see that two algorithms make similar selections: in terms of 
patch positions and patch numbers (32 for ORE-Boosting v^. 31 for ORE-Learning). 
Nonetheless, minor differences is shown w.r.t. the weight assignment, i.e. assigning 
values to the coefficients a^, Vi. The ORE-Boosting aggressively assigns dominant 
weights to a few patches. In contrast, ORE-Learning distributes the weights more uni- 
formly. The more conservative strategy often leads to a higher robustness. 




(a) candidates (b) ORE-Learning (c) ORE-Boosting 



Figure 9: The patch candidates (a) and those selected by ORE-Learning (b) and ORE-Boosting (c). All the 
patches are shown as blocks. Their widths and colors indicate the associated weights a^, A thicker and 
redder edge stands for a larger ai, i.e. a more important patch. The ORE algorithms are conducted on a 
100-D feature space. 



7.3 Face recognition with random occlusions 

The above task is completed nearly perfectly. However, sometimes the faces are con- 
taminated by occlusions and most state-of-the-arts may fail on some of them. The most 
occlusions occur on face images could be divided into two categories: noisy occlusions 
and disguises. Let us consider the noisy ones first. The noisy occlusions are the ones 
not supposed to arise on a human face, or in other words, not face-related. They are un- 
predictable, and thus hard to learn. We then design a experiment to verify the inference 
capability of ORE-based methods. Considering that ORE-Learning and ORE-Boosting 
select similar patches, we only perform the former one in this test. To generate the cor- 
rupted samples for testing, we impose several Gaussian noise blocks on the Yale-B 
faces. The blocks are square and in the size of 5 x s, 5 G {20, 40, 60, 80, 100, 120}. 
The number of the blocks are defined by 

No = max{round(0.4a//5^),3}, (56) 

where a/ represents the area of the whole face image. That is to say, the occluded parts 
won't cover more than 40% area of the original face, unless the number requirement 
A^o > 3 is not met. The yielded faces are shown in Figure 10. We can see that when 
5 = 120, the contaminated parts dominate the face image. 
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(a) s = 20 



(b) s = 40 



(c) s = 80 



(d) s = 120 



Figure 10: The Yale-B faces with Gaussian noise occlusions. The block size is increased from 20 to 120. 
We can see that when s = 120, more than 60% of the face image are totally contaminated. 

Before testing, we train our ORE models on the clean faces (30 faces for each 
individual). Then, on the contaminated faces (also 30 faces for each individual), we test 
the learned models, with or without Robust-BPRs, comparing to the modular heuristics. 
In this way, we guarantee that no occlusion information is given in the training phase. 
As a reference, we also perform the standard LRC to illustrate different difficulty levels. 
The experiment is repeated 5 times with the training and test faces selected randomly^. 
The results are shown in Table 2. 

As we can see, again, the proposed ORE models achieve overwhelming perfor- 
mances. In particular, the original ORE-models are nearly (except for the case where 
s = 20) consistently better than all the state-of-the-arts. Furthermore, the Robust-ORE 
models illustrate a very high robustness to the noisy occlusions. It is always ranked 
first in all the conditions and achieves the recognition rates above 98% when s < 120. 
Recall that the performance obtained by ORE models on clean test sets is 99.9%. The 
severe occlusions merely reduce the performance of ORE model by around two per- 
cent. When the face is dominated by continuous occlusions {s = 120), the accuracies 
of modular methods drop sharply to the ones below 60% while that of Robust-ORE 
is still above 90%. This success justifies our assumption about the generic-face-patch 
linear-subspace. 

7.4 Face recognition with expressions and disguises 

Another kind of common occlusions are functional disguises such as sunglasses and 
scarves. They are, generally speaking, face-related and intentionally put onto the faces. 
This kind of occlusions are unavoidable in real life. Besides this difficulty, expression 
is another important influential factor. Expressions invalidate the rigidity of the face 
surface, which is one foundation of the linear-subspace assumption. To verify the 
efficacy of our algorithms on the disguises and expressions, we employ the AR dataset. 
There are 100 individuals in the AR (cropped version) dataset. Each subject consists of 
26 face images which come with different expressions and considerable disguises such 
as scarf and sunglasses (see Figure 11). 

^We guarantee that a clean face and its contaminated version won't be selected simultaneously in each 
test. 
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Figure 1 1 : Images with occlusions and expressions in AR dataset. Note that we use only gray-scale faces in 
the experiment. 

First of all, following the conventional scheme, we use all the clean and inexpres- 
sive faces (8 faces for each individual) as training samples and test the algorithms 
on those with expressions (6 faces per individual), sunglasses (6 faces per individ- 
ual) and scarves (6 faces per individual) respectively. Similar to the strategy for han- 
dling random occlusions, 500 random patches are generated and we also employ the 
Robust-BPRs for testing. The test results can be found in Table 3. Note that the tests 
for Block- SRC, DEF and LRC are conducted once as the data split is deterministic. 
Consequently, no standard deviation is reported for those algorithms. The ORE-based 
methods are still run for 5 times, with different random patches and random projec- 
tions. 

According to the table, Robust-ORE beats other methods in all the scenarios. In 
particular, for the faces with scarves, both of ORE and Robust-ORE are superior to 
other methods. The performance gap between Robust-ORE and the involved state-of- 
the-arts is around 10%. 

7.5 Learn the patterns of disguises and expressions 

The expressions and disguises share one desirable property: they can be characterized 
by typical and limited patterns. One thus can learn those patterns within our ensemble 
learning framework. To verify the learning power of the proposed method, we re- 
split the data: for each individual, 13 images are randomly selected for training while 
the remaining ones are test images. In this way, the ORE-Learning or ORE-Boosting 
algorithm is given the information on disguise patterns. The experiment on AR is rerun 
in the new setting. Table 4 shows the recognition accuracies. Note that the results for 
the 100-D Fisherface are actually obtained by using 95-D features since here (100 
categories) the dimensionality limit for LDA is 99. 

Similar to the previous test, our methods once again show overwhelming superior- 
ity. The Robust-ORE algorithm achieves a recognition rate (9/99.5% which is also the 
best reported result on AR in the similar experimental setting. . In this sense, we can 
conclude that the ORE algorithms can effectively learn the patterns of disguises. The 
boosting-like variation of ORE-Learning obtains remarkable performances as well, but 
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is slightly worse than the original version. Besides the ORE algorithms, the Fish- 
erf ace approach also shows a high learning capacity. With Fisherfaces, the simplest 
Nearest Neighbor algorithm already achieves the recognition rate of 97.9%. This em- 
pirical evidence implies that discriminative face recognition methods usually benefit 
from learning certain face-related patterns. 

Figure 12 shows the patch candidates (Figure 12(a)) and the selected ones for ORE- 
Learning (Figure 12(b)) and ORE-Boosting (Figure 12(c)). As illustrated in the figure, 
the 500 patch candidates redundantly samples the face image. Both ORE-Learning and 
ORE-Boosting choose 54 patches and ORE-Learning still employs a more conservative 
strategy of weight assignment. Differing from Figure 9, the ORE algorithms now focus 
on the forehead more than eyes and the mouth. Considering that sunglasses and scarves 
are usually located in those two places, the disguises' patterns are learned and the 
corresponding patch positions are less trusted during the test. 




(a) candidates (b) ORE-Learning (c) ORE-Boosting 



Figure 12: The patch candidates (a) and those selected by ORE-Learning (b) and ORE-Boosting (c). All 
the patches are shown as blocks. Their widths and colors indicate the associated weights ai, Vi. A thicker 
and redder edge stands for a larger ai,i.e. sl more important patch. The ORE algorithms are conducted on a 
100-D feature space. 



7.6 Efficiency 

For a practical computer vision algorithm, the running speed is usually crucial. Here 
we show the extremely high efficiency of the proposed algorithms, in both terms of 
training and test. 

7.6.1 The verification for the fast model selection 

First of all, let us verify the training-determined model- selection for ORE-Learning 
and ORE-Boosting. Figure 13 demonstrates the training error and test error curves as 
the model complexity, factorized by 1/ A, is increasing. The two curves, as we can see, 
show nearly identical tendencies. In particular, when 1/A = le-5, i.e. only trivial regu- 
larization is imposed, we still can not observe any deviation between the two errors. In 
other words, overfitting does not occur. Consequently, we can employ the training error 
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of ORE-Learning as an accurate measurement of the generaUzation abiUty and chose 
an proper model parameter on it. The required time for model- selection is therefore 
reduced dramatically. 




-Teset error 
-Training error 



1 2 5 7 10 20 30 40 50 100 200 300 500 10001 e5 



Figure 13: The training errors and test errors change with the increasing model complexity: ^ . Note that the 
X-axis is not linearly scaled. The results are generated in a 100-D feature space with Yale-B faces. 



7.6.2 The improvement on the training speed 

Besides the fast model selection, we also theoretically validate the fast training proce- 
dure. It achieves, as illustrated above, very promising accuracies. Here we evaluate 
the improvement on the training speed. Figure 14 depicts the difference on the time 
consumptions for training a Boosted-ORE model, between the methods with and with- 
out updating BPRs at every iteration. The test is conducted with the increasing number 
(from 10 to 2, 000) of BPRs and trade-off parameter A = 0.02 in the 100-D feature 
space. As illustrated, the efficiency gap is huge. Without the BPR-recalculation, one 
could save the training time by from 700 seconds (10 BPRs) to more than 10 days 
(2, 000 BPRs). 



7.6.3 The highest execution efficiency 

At last, let us verify the most important efficiency property — execution speed. The 
test face (or face patch) is randomly mapped to a lower-dimensional space. Given a 
reduced dimensionality, all the face recognition algorithms are performed 100 times 
on faces from Yale-B. We record the elapsed times (in ms) for each method and show 
the average values in Figure 15. Note that for LRC and ORE-based methods, there is 
no need to perform LRs when testing as all the representation bases are deterministic. 
Before test, one can pre-calculate and store all the matrices 

E = (X^X)-^X^, (57) 

where X represents different basis for different algorithms. Then the representation 
coefficients /3 for the test face (or patch) y can be obtained via a simple matrix multi- 
plication, i.e. 

/3 = Ey. (58) 
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Figure 14: The training times consumed by the ORE-Boosting methods with the BPR recalculations and 
without them. Note that the y-axis is shown in the logarithmic scale. The results are obtained on the AR 
dataset with 100-D features. 
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Figure 15: Comparison of the running time. Note that the y-axis is in the logarithmic scale. We don't 
perform ORE algorithms in the 400-D feature space as each patch only has 225 pixels. The 200-D results 
for the proposed methods are actually obtained in the original 225-D space. 

As demonstrated, the SRC-based algorithms are the slowest two. The original SRC 
needs up to 31 second (400-D) to process one test face. The Block-SRC approach, 
which shows relatively high robustness in the literature, shows even worse efficiency. 
For 400-D features, one need to wait more than 4 minutes for one prediction yielded 
by Block-SRC. NFL also performs slowly. It requires 9 to 1, 113 ms to handle one test 
image. In contrast, the ORE-based methods consistently outperform others in terms of 
efficiency. In particular, on the 200-D (225-D in fact) feature space, one only needs 
16 ms to identify a probe face by using either ORE algorithm. This speed not only 
overwhelms those of SRC and NFL, but is also 2-time higher than those of LRC and 
NN. 

Such a high efficiency, however, seems not reliable. Intuitively, the time consumed 
by LRC might be always shorter than that for ORE because ORE performs multiple 
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LRs (here actually matrix multiplications) while LRC only performs one. We then 
track the execution time of the Matlab code via the "profile" facility. We found that, 
with high-dimensional features and efficient classifiers, it is the dimension reduction 
which dominates the time usage. The NN and LRC algorithms both perform the linear 
projection over all the pixels while ORE only select a small part (the pixels in the 
patches) of them to do the dimension-reduction. As a result, if not too many patches 
are selected, the ORE algorithms usually illustrate even higher efficiencies than LRC 
and NN. 

Recall that the proposed methods achieve almost all the best recognition rates in 
different conditions. We draw the conclusion that, the ORE model is a very promising 
face recognizer which is not only most accurate, but also most efficient. 

8 Conclusion and future topics 

In this paper, a learning-inference-mixed framework is proposed for face recognition. 
By observing that, in practice, only partial face is reliable for the linear-subspace as- 
sumption. We generate random face patches and conduct LRs on each of them. The 
patch-based linear representations are interpreted by using the Bayesian theory and lin- 
early aggregated via minimizing the empirical risks. The yielded combination. Optimal 
Representation Ensemble, shows high capabiHty of learning face-related patterns and 
outperforms state-of-the-arts on both accuracy and efficiency. With ORE-models, one 
can almost perfectly recognize the faces in Yale-B (with the accuracy 99.9%) and AR 
(with the accuracy 99.5%) dataset, and at a remarkable speed (below 20 ms per face 
using the unoptimized Matlab code and one CPU core). 

For handling foreign patterns arising in test faces, the Generic-Face-Confidence is 
derived by taking the non-face patch into consideration. Facilitated by GFCs, the ORE- 
model shows a high robustness to noisy occlusions, expresses and disguises. It beats the 
modular heuristics under nearly all the circumstances. In particular, for Gaussian noise 
blocks, the recognition rate of our method is always above 93% and fluctuates around 
99% when the blocks are not too large. For real-life disguises and facial expressions, 
Robust-ORE also outperforms the competitors consistently. 

In addition, to accommodate the instance-based BPRs, an novel ensemble learning 
algorithm is designed based on the proposed leave-one-out margins. The learning al- 
gorithm, ORE-Leaming, is theoretically and empirically proved to be resistant to over- 
fittings. This desirable property leads to a training-determined model-selection, which 
is much faster than conventional n-fold cross-validations. For immense BPR sets, we 
propose the ORE-Boosting algorithm to exploit the vast functional spaces. Further- 
more, we also increase the training speed a lot by proving that the ORE-Boosting is 
actually data- weight-free. 

As to the future work, one promising direction is to exploit the spatial information 
for ORE-models. Similar to [30], one could also employ a Markov Random Field 
(MRF) method to analyze the patch-based GFCs. Even higher accuracies could be 
achieved, considering that the GFC is more informative and robust than a single pixel. 
Secondly, ORE-models can be expanded for the video-based face recognition via using 
online-learning algorithms. Considering that the Robust-BPRs can distinguish face 
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parts from the non-facial ones, we also want to design a ORE-based face detection 
algorithm. By merging the detector with this work, we could finally obtain a multiple- 
task ORE-model that performs the detection and recognition simultaneously. 
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Algorithm 1: ORE-Boosting 
Input: 

• A set of training data X = [xi , X2 , • • • , x^v] . 

• A set of patch-locations, indexed by 1, 2, • • • , T. 

• A termination threshold e > 0. 

• A maximum training step S. 

• A primitive dual problem: 

1 ^ 

min r + — > (ui logUi — uA , s.t. u ^ 0. 

u,r A 

i 

begin 

• Initialize a = 0, t = 0, = Vi; 
for 5 ^ 1 to 5 do 

• Find a new BPR, bjl , such that 

N 

t*= argmax V (x,) - 1/K) ; (35) 

te{i,2,...,T} 

• if Ez^i - < r + e, break; 

• Assign the inequality 

AT 

Y^u, (61l,,,(xi)-l/i^)<r 

i = l 

into the dual problem as its sth constraint; 
\_ • Solve the updated problem; 

• Calculate the primal variable ol according to the dual solutions and KKT 
conditions; 

end 

Output: The Boosted-ORE: |(y) = argmax J^^^ oli • bt(y). 

k 
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25-D 


50-D 


100-D 


200-D 


400-D 




NN 


93.4 ± L3 










LDA 


NFL 


89.4 ± LO 










SRC 
LRC 


92.5 ± L2 
58.0 ± L9 












NN 


42.6 ±4.0 


51.4 ± 1.5 


54.2 ±3.0 


54.8 ± 1.7 


56.6 ± 1.5 


Rand 


NFL 


83.2 ± L7 


88.2 ± 1.0 


89.5 ±0.6 


90.7 ±0.5 


90.9 ±0.4 


SRC 


80.1 ± 1.6 


90.7 ±1.0 


94.7 ±0.5 


96.6 ±0.7 


97.1 ±0.5 




LRC 


25.9 ±4.1 


88.1 ±0.6 


93.1 ± 1.2 


94.5 ±0.4 


94.7 ±0.4 




NN 


22.3 ± 1.8 


30.4 ± 1.7 


34.4 ±0.5 


36.6 ± 1.2 


37.0 ±1.0 


PCA 


NFL 


69.5 ± 1.4 


77.4 ± 1.2 


81.4 ± 1.0 


83.0 ±0.5 


83.5 ±0.5 


SRC 


80.4 ± 1.6 


89.1 ±0.9 


92.8 ±0.8 


94.2 ±0.7 


95.1 ±0.7 




LRC 


74.7 ± 1.9 


88.1 ±0.4 


89.8 ±0.3 


90.7 ±0.5 


90.8 ±0.6 


ORE 




96.5 ±0.5 


99.6 ±0.2 


99.7±0.1 


99.9 ±0.1 




Robust-ORE 


98.3 ±0.3 


99.8 ±0.2 


99.9 ±0.1 


99.9 ±0.1 




Boosted-ORE 


95.6 ±1.2 


99.6 ± 0.2 


99.8 ±0.1 


99.9 ±0.1 





Table 1 : The comparison of accuracy on Yale-B . The highest recognition rates are shown in bold. Note that 
we only perform algorithms with the Fisherface (LDA) on the 25-D feature space. The original patch has 
225 pixels, thus we can't conduct ORE algorithms with 400-D features. 





s = 20 


s = 40 


s = 60 


s = 80 


s = 100 


s = 120 


LRC (400-D) 
DEF 

Block-SRC 


74.1 ± 1.4 
42.9 ±0.3 
94.1 ±0.5 


69.7 ±1.3 
80.1 ± 0.4 
93.3 ±0.5 


68.4 ±1.5 
88.8 ±1.0 
94.1 ±0.5 


45.5 ± 1.4 
72.3 ± 0.6 

85.7 ±0.8 


30.4 ±0.7 
48.0 ± 1.4 
78.3 ±0.4 


16.7 ±0.2 
26.6 ± 1.3 

56.8 ±0.6 


ORE 

Robust-ORE 


93.9 ±2.6 
98.5 ±0.7 


98.2 ± 1.0 
99.6 ±0.2 


98.8 ±0.6 
99.7 ±0.1 


97.5 ± 1.7 
99.4 ±0.5 


94.2 ±3.6 

98.3 ±1.0 


86.1 ±8.9 
93.8 ±4.6 



Table 2: The comparison of accuracy on the occluded Yale-B. The highest recognition rates are shown in 
bold. Robust-ORE represents the ORE-model with Robust-BPRs. Note that the original LRC is performed 
with 400-D Randomfaces. 





Expressions 


Sunglasses 


Scarves 


LRC (400-D) 


8L0 


54.5 


10.7 


DEF 


88.2 


91.2 


85.2 


Block-SRC 


87.5 


95.7 


86.0 


ORE 


82.0 ±1.2 


85.0 ±3.9 


86.5 ±0.7 


Robust-ORE 


92.8 ±0.9 


96.1 ±1.8 


95.8 ±1.2 



Table 3: The comparison of accuracy on the AR dataset. The highest recognition rates are shown in bold. 
Robust-ORE represents the ORE-model with Robust-BPRs. Note that the original LRC is performed with 
400-D Randomfaces. 
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25-D 


50-D 


100-D 


200-D 


400-D 




NN 


95.3 ±0.3 


97.4 ±0.6 


97.9 ±0.5 






LDA 


NFL 


92.5 ±0.8 


96.8 ±0.4 


97.9 ±0.2 






SRC 


94.6 ±0.5 


97.4 ±0.5 


97.9 ±0.4 








LRC 


72.8 ± L6 


94.5 ±0.4 


97.1 ±0.3 








NN 


17.0 ±0.9 


19.8 ± 1.6 


22.8 ±2.3 


22.2 ± 1.5 


23.6 ± 1.1 


Rand 


NFL 


44.9 ±3.0 


55.2 ± 1.9 


60.9 ± 1.7 


63.1 ± 1.2 


65.1 ± 1.2 


SRC 


45.4 ±0.5 


71.3 ± 1.7 


85.8 ± 1.1 


91.5 ±0.7 


93.9 ±0.6 




LRC 


43.0 ±2.2 


71.6 ± 1.8 


78.9 ± 1.3 


82.1 ± 1.2 


83.5 ±0.7 




NN 


19.4 ± L3 


20.4 ± 1.1 


21.7± 1.3 


21.8 ± 1.2 


22.0 ±1.0 


PCA 


NFL 


41.9 ± 1.6 


48.2 ± 1.2 


52.1 ±1.6 


54.3 ± 1.3 


55.4 ± 1.3 


SRC 


52.7 ±0.8 


72.1 ± 1.5 


80.8 ± 1.0 


83.6 ±0.5 


83.9 ±0.7 




LRC 


60.3 ±0.8 


75.3 ± 1.0 


80.3 ±0.7 


82.1 ±0.8 


82.7 ±0.8 


ORE 




97.0 ±0.5 


98.7 ±0.5 


99.0 ±0.3 


99.1 ±0.1 




Robust-ORE 


98.4 ±0.5 


99.1 ±0.4 


99.4 ±0.2 


99.5 ±0.2 




Boosted-ORE 


96.8 ± 0.3 


98.6 ± 0.3 


98.9 ±0.3 


99.0 ± 0.4 





Table 4: The comparison of accuracy on AR. The highest recognition rates are shown in bold. Note that we 
only perform algorithms with the Fisherface (LDA) on the 25-D and 50-D feature spaces. The original patch 
has 225 pixels, thus we can't conduct ORE algorithms with 400-D features. 
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